312 research outputs found

    On parameter filtering in continuous subword-unit-based speech recognition

    Get PDF
    Simple IIR or FIR filters have been widely used in isolated or connected word recognition tasks to filter the time sequence of speech spectral parameters, since, despite their simplicity, they significantly improve recognition performance. Those filters, when applied to continuous speech recognition, where phoneme-sized modelling units are used, induce spectral transition spreading and a cross-boundary effect. The authors show how the use of context-dependent units reduces the side effects of the filters and may result in improved recognition performance. When dynamic parameters are not used, filtering seems to be especially useful, even for clean speech, and when they are, filters do well under unmatched training and testing conditions.Peer ReviewedPostprint (published version

    A fast one-pass-training feature selection technique for GMM-based acoustic event detection with audio-visual data

    Get PDF
    Acoustic event detection becomes a difficult task, even for a small number of events, in scenarios where events are produced rather spontaneously and often overlap in time. In this work, we aim to improve the detection rate by means of feature selection. Using a one-against-all detection approach, a new fast one-pass-training algorithm, and an associated highly-precise metric are developed. Choosing a different subset of multimodal features for each acoustic event class, the results obtained from audiovisual data collected in the UPC multimodal room show an improvement in average detection rate with respect to using the whole set of features.Peer ReviewedPreprin

    A hierarchical architecture with feature selection for audio segmentation in a broadcast news domain

    Get PDF
    This work presents a hierarchical HMM-based audio segmentation system with feature selection designed for the Albayzin 2010 Evaluations. We propose an architecture that combines the outputs of individual binary detectors which were trained with a specific class-dependent feature set adapted to the characteristics of each class. A fast one-pass-training wrapper-based technique was used to perform a feature selection and an improvement in average accuracy with respect to using the whole set of features is reported.Peer ReviewedPostprint (published version

    Detection of overlapped acoustic events using fusion of audio and video modalities

    Get PDF
    Acoustic event detection (AED) may help to describe acoustic scenes, and also contribute to improve the robustness of speech technologies. Even if the number of considered events is not large, that detection becomes a difficult task in scenarios where the AEs are produced rather spontaneously and often overlap in time with speech. In this work, fusion of audio and video information at either feature or decision level is performed, and the results are compared for different levels of signal overlaps. The best improvement with respect to an audio-only baseline system was obtained using the featurelevel fusion technique. Furthermore, a significant recognition rate improvement is observed where the AEs are overlapped with loud speech, mainly due to the fact that the video modality remains unaffected by the interfering sound.Peer ReviewedPostprint (published version

    On the potential of channel selection for recognition of reverberated speech with multiple microphones

    Get PDF
    The performance of ASR systems in a room environment with distant microphones is strongly affected by reverberation. As the degree of signal distortion varies among acoustic channels (i.e. microphones), the recognition accuracy can benefit from a proper channel selection. In this paper, we experimentally show that there exists a large margin for WER reduction by channel selection, and discuss several possible methods which do not require any a-priori classification. Moreover, by using a LVCSR task, a significant WER reduction is shown with a simple technique which uses a measure computed from the sub-band time envelope of the various microphone signals.Peer ReviewedPreprin

    Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion

    Get PDF
    Recently, audio segmentation has attracted research interest because of its usefulness in several applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Moreover, a previous audio segmentation stage may be useful to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this article, we present the evaluation of broadcast news audio segmentation systems carried out in the context of the AlbayzĂ­n-2010 evaluation campaign. That evaluation consisted of segmenting audio from the 3/24 Catalan TV channel into five acoustic classes: music, speech, speech over music, speech over noise, and the other. The evaluation results displayed the difficulty of this segmentation task. In this article, after presenting the database and metric, as well as the feature extraction methods and segmentation techniques used by the submitted systems, the experimental results are analyzed and compared, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.Peer ReviewedPostprint (published version

    A simple spectrum estimation technique based on the analytic cepstrum

    Get PDF
    Peer ReviewedPostprint (published version

    Les tecnologies de la parla: lloc de trobada, difícil però necessària, entre lingüística i tecnologia

    Get PDF
    La recerca i el desenvolupament d'aplicacions en tecnologies de la parla han deixat força de banda els coneixements lingüístics. Les raons són diverses, com veurem, però la dificultat del treball conjunt de lingüistes i tecnòlegs no significa que aquest no sigui necessari per arribar més lluny en els objectius de la pròpia tecnologia.Postprint (author’s final draft

    Frequency averaging: a useful multiwindow spectral analysis approach

    Get PDF
    The multiwindow approach is a meaningful framework for nonparametric spectral estimation. It also encompasses several conventional methods as WOSA and frequency-averaged periodogram. Recently, some authors claimed that the Slepian windows of Thomson's method and other related optimal sets of windows show a better performance in terms of resolution, variance and leakage. In this paper, that claim is discussed by means of some simulation examples and by applying the various methods to speech recognition. In conclusion, frequency averaging of the periodogram is a computationally simple method that has a great flexibility for band specification and comparatively shows good performance. In fact, it is the spectral analysis technique most extensively employed for speech recognition.Peer ReviewedPostprint (published version
    • …
    corecore